This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Visualizing phylogenetic trees

Overview

Overview

Phylogenetic trees are used to visualize and describe the relatedness and evolution of organisms based on the sequence of their genetic code. They can be constructed from genetic sequences using distance-based methods (such as neighbor-joining method) or character-based methods (such as maximum likelihood and Bayesian Markov Chain Monte Carlo method). Next-generation sequencing (NGS) has become more affordable and is becoming more widely used in public health to describe pathogens causing infectious diseases. Portable devices decrease the turn around time and make data available for the support of outbreak investigation in real-time. NGS data can be used to identify the origin or source of an outbreak strain and its propagation, as well as determine presence of antimicrobial resistance genes. To visualize the genetic relatedness between samples a phylogenetic tree is constructed. In this page we will learn how to use the ggtree() package, which allows for combination of phylogenetic trees with additional sample data in form of a dataframe in order to help observe patterns and improve understanding of the outbreak dynamic.

Preparation

Preparation

This code chunk shows the loading of required packages:

# First we load the pacman package:
library(pacman)

# This allows us to load multiple packages at the same time in one line of code:
pacman::p_load(here, ggplot2, dplyr, ape, ggtree, treeio, ggnewscale)

There are several different formats in which a phylogenetic tree can be stored (eg. Newick, NEXUS, Phylip). A common one, which we will also use here in this example is the Newick file format (.nwk), which is the standard for representing trees in computer-readable form. Which means, an entire tree can be expressed in a string format such as “((t2:0.04,t1:0.34):0.89,(t5:0.37,(t4:0.03,t3:0.67):0.9):0.59);” listing all nodes and tips and their relationship (branch length) to each other.

It is important to understand that the phylogenetic tree file in itself does not contain sequencing data, but is merely the result of the distances between the sequences. We therefore cannot extract sequencing data from a tree file.

We use the ape() package to import a phylogenetic tree file and store it in a list object of class “phylo”. We inspect our tree object and see it contains 299 tips (or samples) and 236 nodes.


# read in the tree: we use the here package to specify the location of our R project and data files:
tree <- ape::read.tree(here::here("data", "Shigella_tree.nwk"))

# inspect the tree file:
tree
## 
## Phylogenetic tree with 299 tips and 236 internal nodes.
## 
## Tip labels:
##   SRR5006072, SRR4192106, S18BD07865, S18BD00489, S17BD08906, S17BD05939, ...
## Node labels:
##   17, 29, 100, 67, 100, 100, ...
## 
## Rooted; includes branch lengths.

Second we import a table with additional information for each sequenced sample such as gender, country of origine and attributes for antimicrobial resistance:


# We read in a csv file into a dataframe format:
sample_data <- read.csv(here::here("data","sample_data_Shigella_tree.csv"),sep=",", na.strings=c("NA"), head = TRUE, stringsAsFactors=F)

We clean and inspect our data: In order to assign the correct sample data to the phylogenetic tree, the Sample_IDs in the sample_data file need to match the tip.labels in the tree file:


# We clean the data: we select certain columns to be protected from cleaning in order to main tain their formating (eg. for the sample names, as they have to match the names in the phylogenetic tree file)
#sample_data <- linelist::clean_data(sample_data, protect = c(1, 3:5)) 

# We check the formatting of the tip labels in the tree file: 

head(tree$tip.label) # these are the sample names in the tree - we inspect the first 6 with head()
## [1] "SRR5006072" "SRR4192106" "S18BD07865" "S18BD00489" "S17BD08906"
## [6] "S17BD05939"

# We make sure the first column in our dataframe are the Sample_IDs:
colnames(sample_data)   
##  [1] "Sample_ID"                  "serotype"                  
##  [3] "Country"                    "Continent"                 
##  [5] "Travel_history"             "Year"                      
##  [7] "Patient_age"                "Source"                    
##  [9] "Gender"                     "gyrA_mutations"            
## [11] "macrolide_resistance_genes" "ESBL"                      
## [13] "MIC_AZM"                    "MIC_CIP"

# We look at the sample_IDs in the dataframe to make sure the formatting is the same than in the tip.labels (eg. letters are all capital, no extra _ between letters and numbers etc.)
head(sample_data$Sample_ID) # we inspect only the first 6 using head()
## [1] "ERR025692" "ERR025682" "ERR025714" "ERR025713" "ERR025709" "ERR025711"

Upon inspection we can see that the format of sample_ID in the dataframe corresponds to the format of sample names at the tree tips. These do not have to be sorted in the same order to be matched.

We are ready to go!

Simple tree visualization

Simple tree visualization

Different tree layouts:

ggtree() offers many different layout formats and some may be more suitable for your specific purpose than others:

# Examples:
ggtree(tree) # most simple linear tree
ggtree(tree,  branch.length = "none") # most simple linear tree with all tips aligned
ggtree(tree, layout="circular") # most simple circular tree
ggtree(tree, layout="circular", branch.length = "none") # most simple circular tree with all tips aligned

# for other options see online: http://yulab-smu.top/treedata-book/chapter4.html

Simple tree with addition of sample data:

The most easy annotation of your tree is the addition of the sample names at the tips, as well as coloring of tip points and if desired branches:


# A: Plot Circular tree:
ggtree(tree, layout="circular", branch.length='none') %<+% sample_data + # the %<+% is used to add your dataframe with sample data to the tree
  aes(color=I(Source))+ # color the branches according to a variable in your dataframe
  scale_color_manual(name = "Sample Origin", # name of your color scheme (will show up in the legend like this)
                     breaks = c("NRC BEL", "NA"), # the different options in your variable
                     labels = c("NRCSS Belgium", ""), # how you want the different options named in your legend, allows for formatting
                     values= c("blue"), # the color you want to assign to the variable if its "nrc_bel"
                     na.value="grey")+ # for the NA values we choose the color grey
  new_scale_color()+ # allows to add an additional color scheme for another variable
     geom_tippoint(aes(color=Continent), size=1.5)+ # color the tip point by continent, you may change shape adding "shape = "
scale_color_brewer(name = "Continent",  # name of your color scheme (will show up in the legend like this)
                       palette="Set1", # we choose a premade set of colors coming with the brewer package
                   na.value="grey")+ # for the NA values we choose the color grey
  geom_tiplab(color='black', offset = 1, size = 1, geom = "text" , align=TRUE)+ # add the name of the sample to the tip of its branch (you can add as many text lines as you like with the + , you just need to change the offset value to place them next to each other)
  ggtitle("Phylogenetic tree of Shigella sonnei")+ # title of your graph
  theme(axis.title.x=element_blank(), # removes x-axis title
      axis.title.y=element_blank(), # removes y-axis title
     legend.title=element_text(face="bold", size =12), # defines font size and format of the legend title
       legend.text=element_text(face="bold", size =10), # defines font size and format of the legend text
      plot.title = element_text(size =12, face="bold"),  # defines font size and format of the plot title
     legend.position="bottom", # defines placement of the legend
        legend.box="vertical", legend.margin=margin()) # defines placement of the legend


# Export your tree graph:
ggsave(here::here("images", "example_tree_circular_1.png"), width = 12, height = 14)

Manipulation of phylogenetic trees

Manipulation of phylogenetic trees

Sometimes you may have a very large phylogenetic tree and you are only interested in one part of the tree. For example if you produced a tree including historical or international samples to get a large overview of where your dataset might fit in in the bigger picture. But then to look closer at your data you want to inspect only that portion of the bigger tree.

Since the phylogenetic tree file is just the output of sequencing data analysis, we can not manipulate the order of the nodes and branches in the file itself. These have already been determined in previous analysis from the raw NGS data. We are able though to zoom into parts, hide parts and seven subset part of the tree.

Zooming in on one part of the tree:

If you don’t want to “cut” your tree, but only inspect part of it more closely you can zoom in to view a specific part:


# First we plot the whole tree:
p <- ggtree(tree,) %<+% sample_data +
  geom_tiplab(size =1.5) + # labels the tips of all branche with the sample name in the tree file
  geom_text2(aes(subset=!isTip, label=node), size =5, color = "darkred", hjust=1, vjust =1) # labels all the nodes in the tree
p

We want to zoom into the branch which is sticking out, after node number 452 to get a closer look:


viewClade(p , node=452)

Collapsing one part of the tree:

The other way around we may want to ignore this branch which is sticking out and can do so by collapsing it at the node (indicated here by the blue square):

#First we collapse at node 452
p_collapsed <- collapse(p, node=452)

#To not forget that we collapsed this node we assign a symbol to it:
p_collapsed + geom_point2(aes(subset=(node == 452)), size=5, shape=23, fill="steelblue")

Subsetting a tree:

If we want to make a more permanent change and create a new tree to work with we can subset part of it and even save it as new newick tree file.


# To do so you can add the node and tip labels to your tree to see which part you want to subset:
ggtree(tree, branch.length='none', layout='circular') %<+% sample_data +
  geom_tiplab(size =1) + # labels the tips of all branche with the sample name in the tree file
  geom_text2(aes(subset=!isTip, label=node), size =3, color = "darkred") +# labels all the nodes in the tree
 theme(legend.position = "none", # removes the legend all together
 axis.title.x=element_blank(),
      axis.title.y=element_blank(),
      plot.title = element_text(size =12, face="bold"))

# A: Subset tree based on node:
sub_tree1 <- tree_subset(tree, node = 528) # we subset the tree at node 528
# lets have a look at the subset tree:
ggtree(sub_tree1)+  geom_tiplab(size =3) +
  ggtitle("Subset tree 1")

# B: Subset the same part of the tree based on a samplem in this case S17BD07692:
sub_tree2 <- tree_subset(tree,"S17BD07692", levels_back = 9) # levels back defines how many nodes backwards from the sample tip you want to go
# lets have a look at the subset tree:
ggtree(sub_tree2)+  geom_tiplab(size =3)  +
  ggtitle("Subset tree 2")

You can also save your new tree as a Newick file:


ape::write.tree(sub_tree2, file="data/Shigella_subtree_2.nwk")
## Error in file(file, ifelse(append, "a", "w")): cannot open the connection

Rotating nodes in a tree:

As mentioned before we cannot change the order of tips or nodes in the tree, as this is based on their genetic relatedness and is not subject to visual manipulation. But we can rote branches around nodes if that eases our visualization.

First we plot our new subsetted tree with nodelabels to choose the node we want to manipulate:


p <- ggtree(sub_tree2) +  geom_tiplab(size =4) +
  geom_text2(aes(subset=!isTip, label=node), size =5, color = "darkred", hjust =1, vjust =1) # labels all the nodes in the tree
p

We choose to manipulate node number 39: we do so by applying ggtree::rotate() or ggtree::fluip() indirectly to node 36 so node 39 moves to the bottom and nodes 37 and 38 move to the top:


p1 <- p + geom_hilight(39, "steelblue", extend =0.0015)+ # highlights the node 39 in blue
   geom_hilight(37, "yellow", extend =0.0015)  + # highlights the node 37 in yellow
  ggtitle("Original tree") 

# we want to rotate node 36 so node 39 is on the bottom and nodes 37 and 38 move to the top:

rotate(p1, 39) %>% rotate(37)+
  ggtitle("Rotated Node 36")

#or we can use the flip command to achieve the same thing:
flip(p1, 39, 37)

Example subtree with sample data annotation:

Lets say we are investigating the cluster of cases with clonal expansion which occured in 2017 and 2018 at node 39 in our sub-tree. We add the year of strain isolation as well as travel history and color by country to see origin of other closely related strains:


# Add sample data:
ggtree(sub_tree2) %<+% sample_data + 
   geom_tiplab(size =2.5, offset = 0.001, align = TRUE) + # labels the tips of all branche with the sample name in the tree file
  theme_tree2()+
  xlab("genetic distance")+ # add a label to the x-azis
  xlim(0, 0.015)+ # set the x-axis limits of our tree
  geom_tippoint(aes(color=Country), size=1.5)+ # color the tip point by continent
  scale_color_brewer(name = "Country", 
                       palette="Set1", 
                     na.value="grey")+
    geom_tiplab(aes(label = Year), color='blue', offset = 0.0045, size = 3, linetype = "blank" , geom = "text" , align=TRUE)+ # add isolation year
    geom_tiplab(aes(label = Travel_history), color='red', offset = 0.006, size = 3, linetype = "blank" , geom = "text" , align=TRUE)+ # add travel history
  ggtitle("Phylogenetic tree of Belgian S. sonnei strains with travel history")+ # add plot title
  theme(axis.title.x=element_blank(),
      axis.title.y=element_blank(),
     legend.title=element_text(face="bold", size =12),
       legend.text=element_text(face="bold", size =10),
      plot.title = element_text(size =12, face="bold"))

Our observation points towards an import of strains from Asia, which then circulated in Belgium over the years and seem to have caused our latest outbreak.

More complex trees: adding heatmaps of sample data

More complex trees

We can add more complex information, such as categorical presence of antimicrobial resistance genes and numeric values for actually measured resistance to antimicrobials in form of a heatmap using the ggtree::gheatmap() function.

First we need to plot our tree (this can be either linear or circular): We will use the sub_stree from part 3.)

# A: Circular tree:
p <- ggtree(sub_tree2, branch.length='none', layout='circular') %<+% sample_data +
  geom_tiplab(size =3) + 
 theme(legend.position = "none",
 axis.title.x=element_blank(),
      axis.title.y=element_blank(),
      plot.title = element_text(size =12, face="bold",hjust = 0.5, vjust = -15))
p

Second we prepare our data. To visualize different variables with new color schemes, we subset our dataframe to the desired variable.

For example we want to look at gender and mutations that could confer resistance to ciprofloxacin:


# Create your gender dataframe:
gender <- data.frame("gender" = sample_data[,c("Gender")])
# Its important to add the Sample_ID as rownames otherwise it cannot match the data to the tree tip.labels:
rownames(gender) <- sample_data$Sample_ID

# Create your ciprofloxacin dataframe based on mutations in the gyrA gene:
cipR <- data.frame("cipR" = sample_data[,c("gyrA_mutations")])
rownames(cipR) <- sample_data$Sample_ID

# Create your ciprofloxacin dataframe based on the measured minimum inhibitory concentration (MIC) from the laboratory:
MIC_Cip <- data.frame("mic_cip" = sample_data[,c("MIC_CIP")])
rownames(MIC_Cip) <- sample_data$Sample_ID

We create a first plot adding a binary heatmap for gender to the phylogenetic tree:


# First we add gender:
h1 <-  gheatmap(p, gender, offset = 10, width=0.10, color=NULL, # offset shifts the heatmap to the right, width defines the width of the heatmap column, color defines the boarder of the heatmap columns
         colnames = FALSE)+ # hides column names for the heatmap
  scale_fill_manual(name = "Gender", # define the coloring scheme and legend for gender
                    values = c("#00d1b1", "purple"),
                    breaks = c("Male", "Female"),
                    labels = c("Male", "Female"))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())
h1

Then we add information on ciprofloxacin resistance genes:


# First we assigng a new color scheme to our existing plot, this enables us to define and change the colors for our second variable
h2 <- h1 + new_scale_fill() 

# then we combine these into a new plot:
h3 <- gheatmap(h2, cipR,  offset = 12, width=0.10, # adds the second row of heatmap describing ciprofloxacin resistance genes
                colnames = FALSE)+
  scale_fill_manual(name = "Ciprofloxacin resistance \n conferring mutation",
                    values = c("#fe9698","#ea0c92"),
                    breaks = c( "gyrA D87Y", "gyrA S83L"),
                    labels = c( "gyrA d87y", "gyrA s83l"))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())+
  guides(fill=guide_legend(nrow=2,byrow=TRUE))
h3

Next we add continuous data on actual resistance determined by the laboratory as the minimum inhibitory concentration (MIC) of ciprofloxacin :

# First we add the new coloring scheme:
h4 <- h3 + new_scale_fill()

# then we combine the two into a new plot:
h5 <- gheatmap(h4, MIC_Cip,  offset = 14, width=0.10,
                colnames = FALSE)+
  scale_fill_continuous(name = "MIC for ciprofloxacin",
                      low = "yellow", high = "red",
                      breaks = c(0, 0.50, 1.00),
                      na.value = "white")+
   guides(fill = guide_colourbar(barwidth = 5, barheight = 1))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())
h5

We can do the same exercise for a linear tree:

# B: Lineartree:
p <- ggtree(sub_tree2) %<+% sample_data +
  geom_tiplab(size =3) + # labels the tips
  theme_tree2()+
  xlab("genetic distance")+
  xlim(0, 0.015)+
 theme(legend.position = "none",
      axis.title.y=element_blank(),
      plot.title = element_text(size =12, face="bold",hjust = 0.5, vjust = -15))


# First we add gender:

h1 <-  gheatmap(p, gender, offset = 0.003, width=0.1, color="black", 
         colnames = FALSE)+
  scale_fill_manual(name = "Gender",
                    values = c("#00d1b1", "purple"),
                    breaks = c("Male", "Female"),
                    labels = c("Male", "Female"))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())
# h1

# Then we add ciprofloxacin after adding another colorscheme layer:

h2 <- h1 + new_scale_fill()
h3 <- gheatmap(h2, cipR,  offset = 0.004, width=0.1,color="black",
                colnames = FALSE)+
  scale_fill_manual(name = "Ciprofloxacin resistance \n conferring mutation",
                    values = c("#fe9698","#ea0c92"),
                    breaks = c( "gyrA D87Y", "gyrA S83L"),
                    labels = c( "gyrA d87y", "gyrA s83l"))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())+
  guides(fill=guide_legend(nrow=2,byrow=TRUE))
# h3

# Then we add the minimum inhibitory concentration determined by the lab (MIC):
h4 <- h3 + new_scale_fill()
h5 <- gheatmap(h4, MIC_Cip, offset = 0.005, width=0.1, color="black", 
                colnames = FALSE)+
  scale_fill_continuous(name = "MIC for ciprofloxacin",
                      low = "yellow", high = "red",
                      breaks = c(0,0.50,1.00),
                      na.value = "white")+
   guides(fill = guide_colourbar(barwidth = 5, barheight = 1))+
   theme(legend.position="bottom",
        legend.title = element_text(size=10),
        legend.text = element_text(size =8),
        legend.box="horizontal", legend.margin=margin())+
  guides(shape = guide_legend(override.aes = list(size = 2)))
h5